鐵人賽第二十四天~KNN實作

2023 iThome 鐵人賽

DAY 24

AI & Data

打造數據科學之路：資料分析與機器學習的完整指南系列第 24 篇

15th鐵人賽

smart1225

2023-10-09 18:57:11

884 瀏覽

分享至

~今天要分享的是「KNN實作」~

理解了KNN的核心概念後來看看在Python中是如何撰寫程式碼來實作的吧~

KNN分析模型在sklearn的neighbors套件底下：
from sklearn import neighbors

#使用在分類問題
neighbors.KNeighborsClassifier()

#使用在迴歸問題
neighbors.KNeighborsRegressor()

KNN在計算樣本間的距離時，可以選擇使用歐式距離法或曼哈頓距離法，在程式碼中可以透過KNN模型裡的參數”p”或”metric”來做設定。
歐式距離法：也稱為直線距離法，計算公式為√((X2-X1)^2+(Y2-Y1)^2)，在KNN模型中可以將p設定為2或將metric設定為'minkowski'來使用歐式距離法。

曼哈頓距離法：也稱為城市街區距離法，計算公式為|X2-X1|+|Y2-Y1|，在KNN模型中可以將p設定為1或將metric設定為"manhattan"或"cityblock"來使用曼哈頓距離法。

[程式碼實作]
迴歸問題：使用sklearn的資料集”boston”進行分析，n_neighbors設定為3

import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()

boston_data=pd.DataFrame(boston['data'],columns=boston['feature_names'])
print("DATA：",boston_data.head())
print("=======================================")

boston_target=pd.DataFrame(boston['target'],columns=['target'])
print("TARGET：",boston_target.head())
print("=======================================")

X=boston_data[["CRIM",'ZN',"INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT"]]
y=boston_target['target'] 
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)

from sklearn import neighbors
KNNR=neighbors.KNeighborsRegressor(n_neighbors=3,p=2)
KNNR.fit(X_train,y_train)

print("解釋力：",KNNR.score(X_test,y_test))
from sklearn.metrics import mean_squared_error
from math import sqrt
KNNR_pred =KNNR.predict(X_test)
rmse = sqrt(mean_squared_error(y_test, KNNR_pred))
print("rmse： %.3f" % rmse) 
print("依變數平均值：",boston_target['target'].mean())

由結果可以得知，此模型的解釋力約為0.68，而目標變數裡的數據平均值約為22.53，模型誤差約為4.92，所以算是預測能力偏普通但可以參考的模型。

分類問題：使用sklearn的資料集”iris”進行分析，n_neighbors設定為5

import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()

iris_data = pd.DataFrame(iris['data'],columns=iris['feature_names'])
print("DATA：",iris_data.head())
print("=======================================")

iris_target=pd.DataFrame(iris['target'],columns=['target'])
print("TARGET：",iris_target.head())
print("=======================================")

X=iris_data[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']]
y=iris_target['target'] 
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)

from sklearn import neighbors
KNNC=neighbors.KNeighborsClassifier(n_neighbors=5,p=2)
KNNC.fit(X_train,y_train)

from sklearn.metrics import classification_report,confusion_matrix
KNNC_pred= KNNC.predict(X_test)
print("混淆矩陣：",confusion_matrix(y_test,KNNC_pred)))
print("======================================================\n")
print("模型驗證指標：",classification_report(y_test,KNNC_pred)))

由結果可以得知，測試資料中有一筆分類為第1類的樣本被模型預測成第2類，因此出現了一點誤差，但整體正確率高達0.98，所以是一個蠻優秀的模型。